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Abstract 

Hierarchical statistical models are widely employed in information science 
and data engineering. The models consist of two types of variables: observable 
variables that represent the given data and latent variables for the unobserv- 
able labels. An asymptotic analysis of the models plays an important role in 
evaluating the learning process; the result of the analysis is applied not only 
to theoretical but also to practical situations, such as optimal model selection 
and active learning. There are many studies of generalization errors, which 
measure the prediction accuracy of the observable variables. However, the 
accuracy of estimating the latent variables has not yet been elucidated. For 
a quantitative evaluation of this, the present paper formulates distribution- 
based functions for the errors in the estimation of the latent variables. The 
asymptotic behavior is analyzed for both the maximum likelihood and the 
Bayes methods. 

Keywords: unsupervised learning, hierarchical parametric models, latent 
variable, maximum likelihood method, Bayes method 



1 Introduction 

Hierarchical probabilistic models, such as mixture models, are mainly employed in 
unsupervised learning. The models have two types of variables: observable and la- 
tent. The observable variables represent the given data, and the latent ones describe 
the hidden data-generation process. For example, in mixture models that are em- 
ployed for clustering tasks, observable variables are the attributes of the given data 
and the latent ones are the unobservable labels. 
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One of the main concerns in unsupervised learning is the analysis of the hid- 
den processes, such as how to assign clustering labels based on the observations. 
Hierarchical models have an appropriate structure for this analysis, because it is 
straightforward to estimate the latent variables from the observable ones. Even 
within the limits of the clustering problem, there are a great variety of ways to 
detect unobservable labels, both probabilistically and deterministically, and many 
criteria have been proposed to evaluate the results (Dubes & Jain, 1979). For 
parametric models, the focus of the present paper, learning algorithms such as the 
expectation-maximization (EM) algorithm and the variational Bayes (VB) method 
(Attias, 1999; Ghahramani & Beal, 2000; Smidl & Quinn, 2005; Beal, 2003) have 
been developed for estimating the latent variables. These algorithms must estimate 
both the parameter and the variables, since the parameter is also unknown in the 
general case. 

Theoretical analysis of the models plays an important role in evaluating the 
learning results. There are many studies on predicting performance in situations 
where both training and test data are described by the observable variables. The 
results of asymptotic analysis have been used for practical applications, such as 
model selection and active learning (Akaike, 1974; Fedorov, 1972). The simplest case 
of the analysis is that the learning model can attain the true model, which generates 
the data. Recently, it has been pointed out that when there is the redundant 
range/dimension of the latent variables in the learning model, singularities exist 
in the parameter space and the conventional statistical analysis is not valid (Amari 
& Ozeki, 2001). To tackle this issue, a theoretical analysis of the Bayes method 
was established using algebraic geometry (Watanabe, 2009). The generalization 
performance was then derived for various models (Yamazaki & Watanabe, 2003a; 
Yamazaki & Watanabe, 2003b; Rusakov & Geiger, 2005; Aoyagi, 2010; Zwiernik, 
2011). Based on this analysis of the singularities, some criteria for model selection 
have been proposed (Watanabe, 2010; Yamazaki et al., 2005; Yamazaki et al., 2006). 

Although validity of the learning algorithms is necessary for unsupervised tasks, 
statistical properties of the accuracy of the estimation of the latent variables have not 
been studied sufficiently. The goal of the present paper is to provide an asymptotic 
analysis for quantitative evaluation of the accuracy. For the first step, we consider 
the simplest case, in which the attributes, such as the range and dimension, of the 
latent variables are known; the true model has a minimal expression in terms of 
the distribution function of the observable data; and there is no singularity in the 
parameter space. The main contributions of the present paper are the following 
three items: (1) various types of estimation for the latent variables and their error 
functions are formulated in a distribution-based manner; (2) the asymptotic forms of 
the error functions are derived on the maximum likelihood and the Bayes methods; 
(3) it is determined that the Bayes method is more accurate than the maximum 
likelihood method in the asymptotic situation. 
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The rest of this paper is organized as follows: In Section 2 we explain the estima- 
tion of latent variables by comparing it with the prediction of observable variables. 
In Section 3 we provide the formal definitions of the estimation methods and the er- 
ror functions. Section 4 then presents the main results for the asymptotic forms and 
the proofs. Discussions and conclusions are stated in Sections 5 and 6, respectively. 

2 Estimations of Variables 

This section distinguishes between the estimation of latent variables and the pre- 
diction of observable variables. There are variations on the estimation of latent 
variables due to the estimated targets. 

Assume that the observable data and unobservable labels are represented by the 
observable variables x and the latent variables y, respectively. A set of n indepen- 
dent data pairs is expressed as . . . , More precisely, there is no 
dependency between Xi and Xj or between yi and Uj for i ^ j. 

Figure 1 shows a variety of estimations of variables: prediction of an observable 
variable and three types of estimations of latent variables. Solid and dotted nodes 
are the observable and latent variables, respectively. A data pair is depicted by a 
connection between two nodes. The gray nodes are the target items of the estima- 
tions. We consider a stochastic approach, where the probability distribution of the 
target (s) is estimated from the training data {xi, . . . , 

The top-left panel shows the prediction of unseen observable data. Based on 
{xi, . . . , the next observation x = Xn+i is predicted. The top-right panel shows 
the estimation of {yi, . . . , which is referred to as Type I. In the stochastic 
approach, the joint probability of {yi, . . . , y„} is estimated. The bottom- left panel 
shows marginal estimation, referred to as Type II. The marginal probability of yi 
{yi is the example in the figure) is estimated; the rest of the latent variables in 
the probability are marginalized out. Note that there is no unseen/future data in 
either of Types I or II. The bottom-right panel shows estimation of y in the unseen 
data, which is referred to as Type III. The difference between this and Type II is 
the training data; the corresponding observable part of the target is included in the 
training set in Type II, but it is not included in Type III. In the present paper we 
use a distribution-based approach to analyze the theoretical accuracy of a Type-I 
estimation, but we also consider connections to the other types. 
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Figure 1: Prediction of observable variables and estimations of latent variables. 
The observable data are {xi, . . . Solid and dotted nodes are observable and 

unobservable, respectively. Gray nodes are estimation targets. 

3 Formal Definitions of Estimation Methods and 
Accuracy Evaluations 

This section presents the maximum likelihood and Bayes methods for estimating 
latent variables and the corresponding error functions. Here, we consider only the 
Type-I estimation problem for the joint probability of the hidden part. The other 
types will be defined and discussed in Section 5. 

Let q{x,y) = q{y)q{x\y) be a joint probability of observable variables x G R'^^ 
and latent variables y G {1,2,..., K*}. This definition indicates that both x and y 
are random variables and that a causal relation exists between them. In the case of 
a discrete x such that x G {1, 2, . . . , M}, all the results in this paper hold if / dx is 
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replaced with J2x=i- "^^^ probability of the observable data x is expressed as 

K* 

Qi^) = '^(i{y)<i{x\y)- 

y=l 

We will refer to q{x, y) as the true model. 

We assume that the true probabilistic distribution with respect to x satisfies the 
minimality condition: the range of values of the latent variables denoted by K* is the 
minimum that is required to express q{x). For example, consider a three-component 
model, where q{x\y = 1)7^ Qi^lu = 2) = q{x\y = 3) for all x G i?*^. The minimality 
condition requires the two-component expression 

q{x) =q{y = l)q{x\y = 1) + {qiy = 2) + q{y = 3)}q{x\y = 2) 
=q{y = l)q{x\y = 1) + qiy = 2)q{x\y = 2) 

where 2 = {2,3}. The minimum K* creates a one-to-one relationship between the 
model expression and the function q{x). If x is binary, such as x = {0, 1} in the 
example. The two-component expression is redundant for describing the probabilities 
q{x = 0) and q{x = 1) = 1 — q{x = 0). The model can be simplified to a one- 
component expression: 

q{x) =q{x\y = 1), 

where 1 = {1,2}. We find that there is no need to define the latent variables in 
this case, which means that the minimality condition reduces the redundancy of the 
latent variable space. 

The notation for the data sets is := {(xi, yi), (x„, ?/„,)}, X" = 

{xi, . . . , Xn} and F" = {yi, . . . , ?/„}. The joint probability distribution of (X", Y"^) 
is denoted by = HHi li^i^ Vi)- 

Let p{x,y\w) = p{y\w)p{x\y,w) be a learning model, where w is the parameter 
and its dimension is d. Because the latent variable is unobservable, the learning 
model generally has its own range of variables. Then, the probability of the observ- 
able data is expressed as 

K 

p{x\w) = 'y^^p{y\w)p{x\y, w). 

y=l 

Assume that the learning model can attain the true model, i.e., there exists a set of 
parameters Wt such that 

Wt = {w*\p{x,y\w*) = q{x,y)}. 
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The present paper focuses on the case K = K*, where Wt consists of the unique 
point w*, referred to as the true parameter. Note that there are no symmetric true 
parameters in Wt, because the definition is based on the joint probabihty distribution 
with respect to x and y. 

We introduce two ways to construct a probabihty distribution of based on 
the observable X"'. First, we define an estimation method based on the maximum 
hkehhood estimator. The hkehhood is defined by 



Lx{w) = Ylp{xi\w). 



i=l 



The maximum hkehhood estimator wx is given by 

Wx = argmax 

Then, the estimated probabihty distribution of the latent variables is defined by 



'Y.Yr^p{X^,Y^\wx) 



i\Xi,Wx}- 

1=1 



The notation p{Y^\X^ ,wx) is used when the method is emphasized. 

Next, we define the Bayesian estimation. Let the likelihood of the joint proba- 
bility distribution be 

n 

Lxy{w) = Wp{xi,yi\w). 

i=l 

The marginal likelihood functions are given by 

Z(X",F")= j LxY{wMw;v)dw, 

Z{X-)=J2Z{X-,Y^)= [LxiwMw;v)d 



where (f{w; t]) is a prior with the hyperparameter rj. Then, the probability of is 
expressed as 

^ ^(^"'^") _ (2) 
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The distribution of in the true model is uniquely expressed as 



n n , s 

where q{xi) = ^'(a^i, Accuracy of the latent variable estimation is mea- 

sured by the difference between the true distribution q{Y"\X"') and the estimated 
one p{Y"'\X"'). For the present paper, we define the error function as the average 
KuUback-Leibler divergence, 

Din) =-Ex 
n 

where the expectation is 

Ex4f{X-)] = j fiX-)q{X-)dX\ 

Note that this function is available for any construction of p{Y"'\X"-) when we con- 
sider the cases of the maximum likelihood and the Bayes methods below. 

4 Asymptotic Analysis of the Error Function 

In this section we present and prove the main theorems for the asymptotic forms of 
the error function. 

4.1 Asymptotic Errors of the Two Methods 

Let us define the following Fisher information matrices: 

dlnp{x, y\w) d\np{x, y\w) 

dwi dwj J ' 

dlnp{x\w) dlnp{x\w) 
dwi dwj J ' 



5^g(F"|X")ln 



p(F"|X") 



(3) 



{IxY{w)}i, =E 



{Ixiw)h, =E 



where the expectation is 

K 



^i/(..,)i^/E/(..y)P(..yi»)</.. 
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Theorem 1 In the latent variable estimation given by Eq.l, the error function Eq.3 
has the following asymptotic form: 



D{n) =i-Tr[{/xyK) " Ix{w*)}r^\w*)] + 



Theorem 2 In the latent variable estimation given by Eq.2, the error function Eq.3 
has the following asymptotic form: 

Din) =^lndet [l^y{w*)I^\w*)] + o(^^y 

These theorems reveal the speed of the decrease of the error function when the 
training data size n becomes large. The dominant order is 1/n in both methods, 
and its coefficient depends on the Fisher information matrices. We will present a 
more detailed discussion on the coefficient in Section 5. 

The following corollary shows the advantage of the Bayes estimation. 

Corollary 3 Let the error functions for the maximum likelihood and the Bayes 
methods be denoted by D^^{n) and D^°'y'^^{n), respectively. Assume that Ixy{w*) ^ 
Ix{w*). For any true parameter w* , there exists a positive constant c such that 

D^'^in) - D'^'^y^Hn) > - + of-V 

n \n J 

This result shows that D^^{n) > D^''^^^^{n) for a sufficiently large data size n. 



4.2 Proof of Theorem 1 

First, let us define another Fisher information matrix: 

d\np{y\x, w) d\np{y\x, w) 



{lY\xiw)hj =E 



dwi 



dwj 



Based on p{y\x, w) = p{x, y\w) / p{x\w) , 

Iy\x{w) =Ixy{w) + Ix{w) - Jxy{w) - Jxy^^), 

where 

d\np{x,y\w) d\np(x\w) 



{JxY{w)}ij =E 



dwi 
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According to the definition, we obtain 

1 dp{x,y\w) din p{x\w) 



{JxY{w)}ij =E 



p{x,y\w) dwi 



E 



dp{x, y\w) dlnp{x\w] 
dwi dwj 



dx 



dx 



dp{x\w) d\Yip{x\w] 

dwi dwj 
d\np{x\w) d\np{x\w] 



dwi dwj 
Thus, it holds that 

Iy\x{w) =Ixy{w) - Ix{w). 
Next, let us divide the error function into three parts: 

D{n) =Di{n) - D2{n) - D;{n), 

Dl(n)=-Exnyn[lng(X^r")], 
n 

D2{n) =-Exny4lnp(X",r"|wj^ 



p{x\w)dx = {Ix{w)}ij. 



n 
n 



In- 



p{X''\wx) 



(4) 



(5) 



where the expectation is 

Ex^Y^fiX^Y^)] = I Y,fiX'',Y^)q{X^,Y^)dX^. 

Because D^{n) is the training error on p{x\wx), the asymptotic form is known 
(Akaike, 1974): 

DJn)=-- + o(-]. 



2n \n 
Let another estimator be defined by 

WxY = argmaxLxy(w)• 
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According to the Taylor expansion, D2{n) can be rewritten as 



n 



i\WxY] 



n 



+ —Ex^Y-" 
2n 



-J d\\ip{Xi,Yi\wxY) 



dw 



T\^d^ \np{Xi,Yi\wxY) 



i=l 



Sw 



+ -Ri{5w) 
n 



-E 



n 



X^Y^ 



WXY) 



1=1 



-Ex"Y" 

[6w'^ Ixy{'w*)6w'\ 



where 6w = wx — wxy, and Ri{6w) is the remainder term. The matrix 
Yl^=i ^ ^^^^^i^^^^^^^ was replaced with on the basis of the law of large 

numbers. As for the first term of D2, 



DAn) Ex^Y^ 

n 



Wxy) 



^ i=l 



2n 



n 



because it is the training error on y\wxY)- The factor in the second term of D2 
can be rewritten as 

Ex^Y^ \5w^ Ixy{w*)5vo\ 
= Exr^Y" [{wx - w*YIxy{w*){wx - w*)] 

- Exnyn [{wxY - W*y Ixy{w*){wx - W*)] 

- Ex^Y" [{wx - W*yixY{w*){wxY - W*)] 

+ Ex^Y^ [{lixY - w*yixY{w*)iwxY " w*)] . (6) 
Let us define an extended likelihood function, 



L2iwi2) =J2^np{X„Yi\wi) + J2^np{X, 



iW2 



i=l 



4 = 1 



where Wu = (w-l ,W2) , Wu = (^xy^ ^x) ' ^** — (^* ' ^ ) extended 
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vectors. According to the Taylor expansion, 

dL2{wi2) _r dJ2\np{Xi,Y,\w*) ^ dJ2\np{Xi\w* '^^ ^ 
dwi2 \ dwi ' dw2 

- MSwu, 

Swi2 =Wi2 - W** 

d^j:inp{Xi,Yi\w*) 



M 











According to 



dL2(wi2) 
dwi2 



6wi2 =M" 



0, 6wi2 = Wi2 — w** can be written as 
_,fdZ^np{Xi,Y,\w*)^ dj:\npiXi\w 



dwi dw2 

Based on the central limit theorem, 5wi2 is distributed from A/'(0, nM~^S~^M~^), 
where 



Ixy{w*) Jxy{w* 



The covariance nM ^ of 5w)i2 directly shows the covariance of the estimators 

Wx and wxy in Eq.6. Thus it holds that 

Ex"Y'^ [Sw^ Ixy{w*)5w~\ 
1, 



Tr 
Tr 



Ixy{w*)Ix'(^*' 



Tr 
Tr 



Jxy{w*)I^Hw*' 



Ix{w*)I^\w*) 



n 



Considering the relation Eq.5, we obtain that 

Din) =^TT[Iy\x{w*)Ix\w*)] + o(^^y 
Based on Eq.4, the theorem is proved. (End of Proof) 



4.3 Proof of Theorem 2 

Let us define the following entropy functions: 

K* 

SxY = — 



^ / q{x,y)\nq{x,y)dx, 
y=i -J 



S 



X 



q{x) \nq{x)dx. 
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According to the definition, the error function Eq.3 with the Bayes estimation can 
be rewritten as 

1 



D{n) =-{ FxY{n) - Fxin] 



where 



Fxvin) = — nSxY — -E'x^y 



lnZ(X'^,y") 



Fxin) = - nSx - Ex^ 



Based on the Taylor expansion a,t w = wx, 



lnZ{X'- 



Fxin) = -nSx - -Ex- 



In / exp < \np{X''^\wx) 



1/ 

+ -{w - Wx 



92 lnp(X"|u;x) 



= -nSx- ExA^npiX'^lwx] - Ex 

where ri{w) is the remainder term and 

1 d^lnpiX^'liux) 



w — Wx) + ri{w) j(p{w; 7])dw 

In I e''^'"^ip{w;r])N'{wx,^i/n)dw 



n dw"^ 



which converges to Ix{w*) based on the law of large numbers. Again, applying the 
expansion at w = w* to e'^'^^^^Lfiw; 77), we obtain 



Fx{n)=Ex^ In 
-Ex 



piX'^lwx 

In [ |e"^("'*V(w^* : v) 



- In V27i'^y/det{nlxiw*)}-^ 



+(---) — 9;^ — 



+ r2{w)>Af[wx.,{nIxiw*)} ^)dw 



0(1), 



where r2{w) is the remainder term. The first term is the training error on p{x\wx) 
According to (Akaike, 1974), it holds that 



Ex" 



In 



p{X^\wx) 



d 



0(1). 
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Then, we obtain 

which is consistent with the result of (Clarke & Barron, 1990). By replacing X" 
with (X",F"), 



, . d n ^ det I xy{w*) 

FxY{n) =- In h In ^ — — h o(l). 

2 27re ip{w*;T]) 



Therefore, 



L)(n) = — I lndet/xy(w*) -Indet Jx(w*)l +o( - 
2n 1^ J V'^ 

which proves the theorem. (End of Proof) 
4.4 Proof of Corollary 3 

Because Ixy{w) is symmetric positive definite, we have a decomposition Jxy(w) = 
LL^ , where L is a lower triangular matrix. The other Fisher information matrix 
Ix{w) is also symmetric positive definite. Thus, L'^ Ix^{w)L is positive definite. 
Let Ai > A2 > • • • > Ad > be the eigenvalues of I^^{w)L. According to the 
assumption, at least one eigenvalue is different from the others. Then, we obtain 

2n{D^^{n) - D^'^^'^n)} =Ti[IxYiw)I]^\w)] -d- lndet[IxY{w)Ix\w)] + o(l) 

=Tr[L^/^i(w)L] -c/-lndet[L^J^^(w)L] + 0(1) 

d d 



J]{A,-l}-lnJ]A, + o(l) 

i=l i=l 
d 

5^{A,-l-lnA,} + o(l). 



i=l 



The first term in the last expression is positive, which proves the corollary. (End 
of Proof) 

5 Discussion 

5.1 Symmetry of the Learning Results 

In hierarchical models, there are symmetries in both the parameter space and the 
latent variables. We consider the following simple case to observe them. 
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Example 4 Let q{x) and p{x\w) be Gaussian mixtures that have two components, 

q{x) =a*U{x] 0, 1^) + (1 - a*)U{x] b*, 1^), 
p{x\w) =aM{x] bi, 1^) + (1 - a)Af{x; 62, 1^), 

where f/{x; fi, o"^) is a one- dimensional Gaussian distribution, and a* and a are 
mixing ratios. The minimality condition requires that b* 7^ 0. The parameter of the 
learning model is w = (a, bi, 62)- We assume that 

q{x,y = l) =aW(x;0,l'), 

q{x,y = 2)={l-a*)Af{x-b*,l^), 
p{x,y = l\w) =aj\f{x]bi,l^), 
p{x, y = 2\w) =(1 - a)Af{x; 62, 1^)- 

The learning model has two parameter points at which to express the true model: 
Wti = (a*, 0, b*) and Wt2 = (1 — a,*, 6*, 0). This is known as label switching. According 
to the definition of Wt, the true parameter is the unique point w* = (a*, 0, 6*). This 
implies that the former expression is accepted as the proper estimation of y and the 
latter one, which exchanges the components, is not. This is refiected in the definition 
of the error function (Eq.3). The order of the components strictly eliminates the 
symmetries. We refer to this restriction in the error as the asymmetric constraint. 

Let us investigate the relation between the asymmetric constraint and the error 
value for both methods. For a sufficiently large amount of training data, the likeli- 
hood function in Example 4 has two peaks that share the same value and are in the 
neighborhood of the points wa and Wt2- This is because there is no information on 
the component label from the observable data X"'. Convergence of the maximum 
likelihood estimator wx thus depends on the initial point. Theorem 1 shows the 
asymptotic error for the convergence to w* = Wti- Due to improper labeling, the 
estimation of Wt2 will have a bias term in the asymptotic error, i.e., the error does 
not converge to zero. Therefore, Theorem 1 indicates the best performance that can 
be obtained by the maximum likelihood estimator, but at the same time, it indicates 
that the method will not always achieve this. 

The Bayes estimation also has symmetries both in the latent variables and the 
parameter spaces. Due to the symmetry of the parameter space, the estimated 
distribution in Example 4 satisfies = where means that 

labels 1 and 2 in are swapped for each other. The symmetry may adversely affect 
the error; the estimation result p{Y'^\X'^) with parameter marginalization takes 
account of the symmetry, while q{Y'^\X^) does not. To more precisely investigate 
this effect, we eliminate the parametric symmetry and derive the asymptotic error. 
According to the component parameters, we divide the parameter space into two 
regions, such that Wi = (0,61,62) for 61 < 62 and W2 for bi > 62- Assume that 
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b* > 0, where wa belongs to Wi and Wt2 belongs to W2- Let us define the distribution 
of Y" as 



[w)(p{w] r])dw, 



and the variants of the true model as 



qi{x,y) =p{x,y\wti) = q{x,y), 
(l2ix,y) =p{x,y\wt2). 



We define an error function as follows: 



— min Ex"Y" 
n j=i,2 



In 



q^{Y^\X^) 



which has the true parameter in each symmetric region. We can easily obtain that 
Dsym{n) is asymptotically equivalent to D{n) even in the general case. Therefore, 
we conclude that parameter symmetry does not adversely affect the error value in 
the Bayes method. 



5.2 Relation to Other Error Functions 

We now formulate the predictions of observable data and the remaining estimations 
for Types II and III, and we consider the relations of their error functions to that 
of Type I. 

First, we compare the error function to the generalization error, which measures 
the prediction performance on unseen observable data. The generalization error is 
defined as 



Djn) =E_ 



q{x)m————-dx 



where x is independent of X" in the data-generating process of q{x). The predictive 
distribution p(a;|X") is constructed by 

=p{x\wx) 

for the maximum likelihood method and 



p{x\w)p{w\X^)dw 
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for the Bayes method. The posterior distribution of the Bayes method is given by 

Lx{w)ip{w] 7]) 



Both methods have the same dominant terms in their asymptotic forms, 



DJn) 



d 
"2^ 



The coefficient of the asymptotic generahzation error depends only on the dimension 
of the parameter for any model, but that of D{n) is determined by both the model 
expression and the true parameter w*. 

Next, we discuss Type-II estimation; we focus on the value i/i from and its 
estimation accuracy. Based on the joint probability, the estimation of yi is defined 
by 

y"\yi 

where the summation is taken over Y"- except for y^. Thus the error function depends 
on which yi we exclude. In order to measure the average effect of the exclusions, we 
define the error as follows: 



1 " 



«=1 Vi 



Dy\x^{n) =Ex" 

The maximum likelihood method has the following estimation, 

n 

Y"\yi i=l 





Wx) 


p{Xi 


Wx) 



p{xi\wx) ■ ■ -pi^x^^i 


wx)p{xi,yi 


Wx)p{Xi+i\wx) ■ ■ -piXn 


Wx) 


nr=iP(^i 


Wx) 



_p{xi,yi 


Wx) 


P{Xi 


Wx) 



p{yi\xi,wx)- 



We can easily find that 

Dyix"{n) =Ex" 



n K 



- 5^ (livi 



Xi) In 



--—Ex" 
n 



«=1 3/i=l 

^g(r"|X")ln 





Xi) 


PiVi 


Xi,Wx) 





X") 


p(Yn 


X^,wx) 
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Therefore, it holds that Dy\x"{n) = D{n) in the maximum hkehhood method. How- 
ever, the Bayes method has the estimation. 



J p{xi\w) ■ ■ ■p{xi^i\w)p{xi,yi\w)p{xi+i\w) ■ ■ ■p{xn\w)ip{w;r])dw 



which indicates Dy\x^{n) ^ D{n). A sufficient condition for Dy\x"{n) = D{n) is to 

satisfy = Y{tlP{y^\x'')■ 

Finally, we consider the Type-III estimation. The error is defined by 



Dy\^{n) =E 



XT. 



K 



q{x)^q{y\x) In 



q{y\x) 



y=l 



p{y\x,X'- 



-dx 



Note that the new observation x is not used for estimation of y, or will be 

equivalent to the Type-II error Dy\x^+i{n + 1). The maximum likelihood estimation 
X") is given by 



p{y\x,X 
and for the Bayes method it is 

p{y\x,X-)-- 



_p{x,y\wx) 
p{x\wx) 



Pix,y\w] 
p{x\w) 



■p{w\X'')dw. 



(7) 



Using the result in (Shimodaira, 1993) for a variant Akaike information criterion 
(AIC) from partially observed data, we immediately obtain the asymptotic form of 



Dy\r,{n) as 



Dy\Jn) =— Tr 



2n 



IxYiw*)-Ixiw*)\lxiw 



+ o\ - ). 

n 



We thus conclude that all estimation types have the same accuracy in the maximum 
likelihood method. The difference of the training data between Types II and III 
does not asymptotically affect the estimation results. The analysis of the Type-III 
estimate in the Bayes method is left for future study. 



5.3 Variants of Types II and III 

Table 1 summarizes the results in the previous subsection. The rows indicate the 
maximum likelihood (ML) and the Bayes methods, respectively. The Fisher infor- 
mation matrices Ixy{w*) and Ix{w*) are abbreviated in a form that does not include 
the true parameter, i.e., Ixy and Ix- The error functions of Types II and III in the 
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Table 1: Coefficients of the dominant order 1/n in the error functions 





Prediction 


Type I 


Type II 


Type III 


ML 


d/2 


Tr[{/xy - /x}/x1/2 


Ti[{ixY-ix}r/]n 


Tr[{/xy-/x}/x1/2 


Bayes 


d/2 


In det[Jxy 1x1/2 


unknown 


unknown 



Type ir 



Type IIP 



Figure 2: (Left) Partial marginal estimation for yi, . ■ . ,yan- (Right) Estimation for 
future data yn+ij • • • ? yn+an- 



Bayes method are still unknown. The analysis is not straightforward when there 
is a single target of estimation, because the asymptotic expansion is not available 
when the number of target nodes is constant with respect to the training data size 
n. 

Consider the variants of Types II and III depicted in Figure 2. Assume that 
< a < 1 is a constant rational number and that n gets large enough to satisfy that 
an is an integer. The left panel shows the partial marginal estimation referred to as 
Type IP. We will consider the joint probability of yi, ■ ■ ■ ,yan, where the remaining 
variables yan+i, ■ ■ ■ ,yn have been marginalized out. Type IP is equivalent to Type I 
when a = 1. Note that the order in which the target nodes are determined does not 
change the average accuracy for i.i.d. data. The right panel indicates the estimations 
for future data yn+i, ■ ■ ■ ,yn+an- We refer to it as Type IIP and construct the joint 
probability on these variables. In the variant types, the targets are changed from a 
single node to an nodes, which enables us to analyze the asymptotic behavior. 

We will use the following notation: 

Yi ={yi, ■ ■ ■,yan} 
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for Type II' and 



={yn+l, • • • ; Un+an} 

for Type III'. The Bayes estimations are given by 



/n+an 
n 

i=n+l 



p{w\X'')dw 



p{xi,yi 


w) 


p{Xi 


w) 



for Type II' and Type III', respectively. The respective error functions are defined 
by 



DY^\x^{n) 
DY2\X2{n) 



-- — Ex^i 
an 



X:,(r,|A-)i„i(^^") 



- — Ev" 



an 



X",X2 



In ways similar to the proofs of Theorems 1 and 2, the asymptotic forms are derived 
as follows. 

Theorem 5 In Type IF, the error function has the following asymptotic form: 



^yi|X"(«) 



In det[KxY{w*)Ix{'w 



2an ' V n , 

where Kxy{w) = aIxY{w) + (1 — a)Ix{'w). 
The proof is in the appendix. 

Theorem 6 In Type III', the error function has the following asymptotic form: 

DY2\xAn) =^\ndei[KxY{w*)I^\w*)] + 

This proof is also in the appendix. These theorems show that when Types II' and 
III' have the same a, they asymptotically have the same accuracy. This implies the 
asymptotic equivalency of Types II and III by combining the results of the maximum 
likelihood method. 

Table 2 summarizes the results. Based on the definitions, the results for the 
maximum likelihood method are also available for Types 11' and III'. Using the 
asymptotic forms, we can compare the relation of the magnitudes for the maximum 
likelihood method. 
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Table 2: Coefficients of the dominant order 1/n in the error functions 





Pred. 


Type I 


Type ir 


Type Iir 


ML 


d/2 


Tr[{/xy - /x}/x1/2 


Tr[{/xy - /x}/x1/2 




Bayes 


d/2 


lndet[/xY/xV2 


lndet[i^xy/x1/(2a) 


lndet[i^xy/x1/(2a) 



Corollary 7 Assume that Ixy{w) ^ Ix{w). For < a < 1, there exists a positive 
constant ci such that 

Tt[{Ixy{w) - Ix{w)}Ix\w)] - -lndet[KxYiw)I^\w)] > + o(-) . 

The proof is in the appendix. We immediately obtain the following relation, which 
shows the advantage of the Bayes estimation in the asymptotic case: 

for respective a's. 

By comparing the errors of Types I and IF in the Bayes method, we can obtain 
the effect of supplementary observable data. Let us consider the Type-IL case in 
which the estimation target is Y\ and the training data is only X^. This corresponds 
to the estimation in Type I with an training data, which we emphasize by calling it 
Type r. The difference between Type I' and Type 11' is the addition of supplementary 
data 

Corollary 8 Assume that the minimum eigenvalue of Ixy{w*)Ix^{w*) is not less 
than one, i.e., Xd> 1- The error difference is asymptotically described as 

D{an) - DY,\x4n) =^\ndet\IxY{w*)K-\.{w*)] + 

> h o - 

n \n 

where C2 is a positive constant. This shows that Type IT has a smaller error than 
Type r in the asymptotic situation; the supplementary data make the estimation 
more accurate. 

The proof is in the appendix. 



5.4 Comparison between the Two Methods 

Corollaries 3 and 7 show that the Bayes method is more accurate than the maxi- 
mum likelihood method for Types I, IF, and III'. There have been many data-based 
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comparisons of the predicting performances of these two methods (e.g., (Akaike, 
1980; ?; ?)). We will now discuss the computational costs of the two methods for 
the estimation of latent variables. We note there will be a trade-off between cost 
and accuracy. 

We will assume that the estimated distribution is to be calculated for a practical 
purpose. For example, the value of in Type I is used for sampling label 

assignments and for searching for the optimal assignment The 
maximum likelihood method requires the determination of wx for all Types I, II, 
and III. The computation is not expensive once wx is successfully found, but the 
global maximum point of the likelihood function is not easily obtained. The EM 
algorithm is commonly used for searching for the maximum likelihood estimator in 
models with latent variables, but it is often trapped in one of the local maxima. The 
results of the steepest descent method also depend on the initial point and the step 
size of the iteration. 

The Bayes method is generally expensive. In the estimated distribution 
p{Y^\X^) of Type I, the numerator Z{X"',Y^) contains integrals that depend on 
Y"". Sampling yi in Type II requires the same computation as for Type I: we can 
obtain yi by ignoring the other elements Y"" \ yi, which realizes the marginalization 
A conjugate prior allows us to have a tractable form of F") 
(Dawid & Lauritzen, 1993; Heckerman, 1999), which reduces the computational cost. 
In Type III, Eq.7 shows that there is no direct sampling method for y. In this case, 
expensive sampling from the posterior p{w\X^) is necessary. 

The VB method is an approximation that allows the direct computation of 
P{Y"'\X^) and p{w\X"'), which have tractable forms and reduced computational 
costs. However, the assumption that P(y"|X") and are independent does 

not hold in many cases. We conjecture that the P{Y^ \X^) of the VB method will 
be less accurate than that of the original Bayes method. 

6 Conclusions 

In the present paper we formalized the estimation from the observable data of the 
distribution of the latent variables, and we measured its accuracy by using the 
Kullback-Leibler divergence. We succeeded in deriving the asymptotic error func- 
tions for both the maximum likelihood and the Bayes methods. These results allow 
us to mathematically compare the estimation methods: we determined that the 
Bayes method is more accurate than the maximum likelihood method in most cases, 
while their prediction accuracies are equivalent. The generalization error has been 
approximated from the given observable data, such as by using the cross-validation 
and bootstrap methods, but there is no approximation technique for the error of 
the estimation of the latent variables, because the latent data can not be obtained. 



21 



Therefore, these asymptotic forms are thus far the only way we have to estimate 
their accuracy. 
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Appendix 

In this section, we prove Theorem 5, Theorem 6, Corollary 7, and Corollary 8. 

Proof of Theorem 5 

The error function is rewritten as 

In / LxY{w)ip{w;ri)dw 



Fxri.^) = - anSxY - (1 - a)nSx - Ex»,y, 

an n 

L''^'>y{w) =Y[p{xj,yj\w) Y[ P{xi\w). 

j=l i=an+l 



Based on the Taylor expansion at w = w^^\ where w^^^ = argmax 



w 



+ ln / expi -n(w-wW)^G«(X",ri)(w;-wW)+r3H !> •^[w] T])dw 



where r-^{w) is the remainder term and 

32 / 



r.2 / oin n . 

^ j=l i=an+l ^ 

The first and the second terms of F^^y(n) correspond to the training error. Following 
the same method as we used in the proof of Theorem 2 and noting that 



XY[W ), 
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we obtain 



2 27re ' Lp{w*;ri) 
which completes the proof. (End of Proof) 

Proof of Theorem 6 

The error function is rewritten as 



("2) 

FxH^) = - anSxY - nSx - Ex^,x2,Y2 

n+an n 
j=n+l 1=1 



In / L^xY{w)ip{w,ri)dw 



Based on the Taylor expansion at w = w^'^\ where w'-^^ = argmaxL*^^^(w), 



(liVj 




PiVj 


Xj, 



+ ln / exp<^ -n{w -w^^YG^'^\X'^,X2,Y2){w -w^^^) +n{w)\^{w]r])dw 



(2)- 



where r^lw) is the remainder term and 

^ j=n+l 

The first and the second terms of F^y^n) correspond to the training error, which 
are stated as 



E 



E 1" 

j=n+l 







PiVj 


Xj, 



+ E'- 



-Tr 



{alY\x{w*) + Ix{w*)}KxY\ 



w 



+ 0(1). 



Following the same method we used in the proof of Theorem 2 and noting that 

G'(2)(X",X2,r2)^i^xyM, 
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we obtain 



4^].(n)=-Tr 



{alrixiw*) + Ixiw*)}KxYiwT' 



d n ^ydet Kxy{w*) 

H — 111 h m — h o(l) 

2 27r Lp[w*;r]) 

=- In h In — h o(l), 

2 2ne Lp[w*;r]) 

which completes the proof. (End of Proof) 

Proof of Corollary 7 

It holds that 

-\ndet[KxY{w)I^\w)] =-\ndet[a{IxY{w) - Ix{w)}I^\w) + E^], 
a a 

where is the d x d unit matrix. On the other hand, 

Tt[{Ixy{w) - Ix{w)}I^\w)] =-{Ti[a{IxY{w) - Ix{w)}I],\w) + E^] - d 



a 

It is easy to confirm that aLj I^^{w)Li + E^ is positive definite, where Lj Li = 
Ixy{u]) — Ixiw). Considering the eigenvalues /ii > /i2 > ■ ■ ■ > /^d > 0, we can 
obtain the following relation in the same way as we did in the proof of Corollary 3: 

Tt[{Ixy{w) - Ix{w)}I^\w)] - -\ndet[KxY{w)I^\w)] =1 V (/i, - 1 - In/iA. 

a a ^ — ^ I I 

1=1 ^ ' 

It is easy to confirm that the right-hand side is positive, which completes the proof. 
(End of Proof) 

Proof of Corollary 8 

Based on the eigenvalues of Ixy{w*)Ix^{w*), it holds that 

\adei[IxY{w*)KxY{w*)] =\ndet[IxY{w*)Ix^{w*)] -\ndet[aIxY{w*)Ix^{w*) + (1 - a)E, 



. . d\ 

d d 



:^lnA,-^ln{aA, + (!-«)} > 0, 



i=l i=l 

which completes the proof. (End of Proof) 
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